Neural Networks

Reverse-engineering the brain

Overview of Algorithms

  • Perceptrons
    • Linear classifier based on neuron
  • Gradient Descent
    • Essential part of many kinds of learning algorithms
  • Feedforward (multilayer) networks
    • Simplest deep learning algorithm
  • Backpropagation
    • Most widely used algorithm for training neural nets

Brain vs CPU

  • Lots of research into mapping and studying the brain
  • $\sim$ 10^10 neurons, switching in $\sim$ .001s (transistor $\sim$ GHz)
  • $\sim$ 10^5 connections per neuron (transistor $\sim$ 10)
  • Consumes $\sim$ 40W
  • Recognizes scene in $\sim$ .1s (massive parallellism!)
  • Visual cortex (V1) $\sim$ 140 million neurons
    • Next cortices (V2-V5) do increasingly complex processing

Example: Character recognition

In [11]:
mnist_data = getOMLDataSet(did = 554)$data # MNIST dataset from OpenML
par( mfrow = c(10,10), mai = c(0,0,0,0)) # 10x10 grid
for(i in 1:100){ # Convert first 100 rows to 28x28 matrix and plot 
    y = matrix(as.matrix(mnist_data[i, 1:784]), nrow=28)
    image( y[,nrow(y):1], axes = FALSE, col = gray(255:0 / 255))}

How neurons work

  • Structure: dendrites, cell body, axon, synapses

A Neuron

  • Neuron triggers, sends action potential (voltage) through axon
  • Transmission across synapse is chemical
    • Action potential causes release of neurotransmitters
    • Diffuse over to other side, create signal in next dendrite
    • Synapse connection can be weak or strong
  • Hebbian learning: neurons that fire together, wire together
    • Synapses grow stronger through experience
    • Long-term potentiation: more dendritic receptors

Perceptron: Representation

In [4]:
perceptron()
  • $p$ binary inputs $x_1..x_p$
  • One binary output $y = f(x)$
  • Single neuron, every connection has weight $w_1..w_p \in I\!R$
  • Neuron receives weighted sum of inputs
  • Threshold function $\sigma$ defines when neuron fires $$f(x) = \sigma(\sum_{i=1}^p w_i x_i) = \sigma(w \cdotp x) = \begin{cases} 1, & w \cdotp x \geq \theta \\ 0, & \text{otherwise} \end{cases} $$
  • Mathematical simplification: add bias $w_0 = -\theta$, $x_0=1$ $$f(x) = \sigma(\sum_{i=0}^p w_i x_i) = \sigma(w \cdotp x) = \begin{cases} 1, & w \cdotp x \geq 0 \\ 0, & \text{otherwise} \end{cases} $$
In [5]:
px <- seq(-6, 6, length.out = 101)
plot(px, px>=0, type = "l", xlab = "", ylab = "", main = "Perceptron response")

What can perceptrons learn?

  • Can it learn logical gates (AND, OR, XOR)?
  • Shall we try MNIST?
    • 784 inputs, lots of weights to learn.
In [6]:
mnist_bin = droplevels(subset(mnist_data, class %in% c(0,1))) #MNIST with only 0 and 1 (binary)
# Now, plot output for 2 random pixels
ggplot(data=mnist_bin, aes(x=pixel263, y=pixel518, color=class)) + geom_point()
In [7]:
mnist_task = makeClassifTask(data = mnist_bin, target = "class")
# We're mimicking a perceptron by constructing a neural net with 1 hidden node
# Actually this is a sigmoid perceptron, but still quite close
perceptron = makeLearner("classif.nnet", par.vals = list(size = 1, trace = FALSE))
plotLearnerPrediction(perceptron, mnist_task, features = c("pixel263", "pixel518")) + theme_cowplot()

Perceptrons can only learn linear functions

  • Indeed: $$f(x) = \sigma(\sum_{i=0}^p w_i x_i) = w_0 x_0 + w_1 x_1 + .. + w_p x_p $$
  • In $n$ dimensions, perceptron can only learn hyperplanes
  • Simple concepts like XOR cannot be learned
    • Requires a 2-layer perceptron
  • Hence, we need more than 1 neuron, we want networks of these

Perceptron: Optimization (training)

  • Perceptron $\sim$ device that makes decisions by weighing up evidence
  • You have to learn how much each input affects the desired outcome
  • Like a social network: should I go see movie X?
    • Each friend possibly influences your decision
    • You learn whose recommendations to trust more than others
  • How to learn those weights?

Perceptron training rule

  • Initialize weights randomly
  • With every example $i$, update weights: $$w_i \leftarrow w_i + \Delta w_i$$ where $$\Delta w_i = \eta(t-o)x_i$$ with:
  • t = c(x): target value
  • o: perceptron output
  • $\eta$: small constant (e.g. 0.1) called learning rate

Why does this work?

  • If outputting correct value: do nothing
  • If outputting incorrect value:
    • Predict 0 when target is 1: $t-o=1$
      • Increase weight of inputs that are hot
    • Predict 1 when target is 0: $t-o=-1$
      • Decrease weight of inputs that are hot

Convergence

  • Any guarantee that it will converge?
  • Proof that it will converge if:
    • $\eta$ is sufficiently small
    • Training data is linearly separable
      • with noise, no convergence
      • $\Delta w_i$ will move hyperplane towards noise, misclassify other points
      • 'solution': stop after $t$ iterations

Gradient descent

  • Still learn hyperplane, but differently
    • Most robust, and generalizes to complex networks
  • First, consider linear unit: $$o = w_0 + w_1 x_1 + .. + w_p x_p $$
  • Let's learn weights that minimize the squared error: $$E[w] = \frac{1}{2} \sum\limits_{d=1}^{n} (t_d-o_d)^2 $$ with $n$ training examples
  • Quadratic function of linear function of weight = quadratic functions of weights
    • Much easier to learn (no local mimima)
    • True for perceptron (not general network)

How do you find optimum?

In [8]:
ws = seq(0,4,len=20) # for plotting
f = function(w) { (w-2)^2 } # function to optimize
plot(ws , f(ws), type="l") # Plot target
w = c(0.1,0.1-0.1*2*(0.1-2)) # We'll see later how to compute this
lines (w, f(w), type="b",col="blue")
  • Gradient: $$\nabla E[w] = \left[ \frac{\delta E}{\delta w_0} , \frac{\delta E}{\delta w_1} , \ldots, \frac{\delta E}{\delta w_p} \right]$$
  • Vector where each component is partial derivative of error with respect to that weight
    • Large value: move in that direction a lot
  • Training rule: $$ \Delta w = - \eta E[w] $$ i.e. $$ \Delta w_i = - \eta \frac{\delta E}{\delta w_i} $$
In [9]:
grad = function(w){2*(w-2)} # gradient df/dw 
 
w = 0.1 # initialize (first guess)
learningRate = 0.8 # try smaller and larger values
wtrace = w # initialize (first guess)
ftrace = f(w) # store y-values
for (step in 1:100) {
w = w - learningRate*grad(w) # gradient descent update
wtrace = c(wtrace,w) # store next x-value
ftrace = c(ftrace,f(w)) # store next y-value
}

plot(ws , f(ws), type="l") # Plot target
lines( wtrace , ftrace , type="b",col="blue") # Plot steps

Computing the gradient

\begin{align} \frac{\delta E}{\delta w_i} = \frac{\delta}{\delta w_i} \frac{1}{2} \sum_d (t_d-o_d)^2 \\ & = \frac{1}{2} \sum_d \frac{\delta}{\delta w_i} (t_d-o_d)^2 \\ & = \frac{1}{2} \sum_d 2 (t_d-o_d) \frac{\delta}{\delta w_i} (t_d-o_d) \\ & = \sum_d (t_d-o_d) \frac{\delta}{\delta w_i} (t_d-w \cdotp x_d)\\ & = \sum_d (t_d-o_d) (-x_{i,d}) \end{align}

Gradient descent

  • Initialize each $w_i$ to some small random value
  • Until termination condition is met:
    • Initialize each $\Delta w_i$ to 0
    • For each (x,t) in training examples, do: $$\Delta w_i \leftarrow \Delta w_i + \eta(t-o)x_i$$
    • For each linear unit weight w_i, do: $$w_i \leftarrow w_i + \Delta w_i $$
In [18]:
# Print the gradient descent trace. Note that the implementation does multiple restarts.
perceptron = makeLearner("classif.nnet", par.vals = list(size = 1, trace = TRUE))
plotLearnerPrediction(perceptron, mnist_task, cv = 0, features = c("pixel263", "pixel518")) + theme_cowplot()
# weights:  5
initial  value 9088.225535 
iter  10 value 3424.542088
iter  20 value 3362.401146
final  value 3362.376301 
converged

Gradient descent: conclusions

  • Guaranteed to converge to hypothesis with minumum squared error
  • Given sufficiently small learning rate $\eta$
  • Even when training data contains noise
  • Even when training data is not separable by H
  • In general, there is no closed form solution for $\frac{\delta E}{\delta w_i} = 0$, so we need gradient descent

Stochastic gradient descent

  • Batch mode: look at all examples, compute gradient, update by that gradient
    • Ensures guarantees we just discussed
  • Stochastic gradient descent (incremental mode):
    • Update weights after each training example
    • Doesn't go straight to goal, but 'zigzags' towards it
  • Works well in practice
    • Faster: you do more updates (even if they zigzag)
    • Noise may be helpful: avoids small local optima (bouncing ball)

Refining the model of the neuron

  • Action potentials are not big jolts, but short-lived spikes
  • Neuron spikes more frequently the more it is stimulated
    • spike frequency $\nu$
    • saturates at max. frequency
  • The amount of stimulation is called membrane potential $u$
  • Spike frequency is non-linear function of membrane potential
    • Sigmoid function, or S-curve
    • Most important curve in the world: phase transitions $$\nu = \sigma(u) = \frac{1}{1+e^{-u}}$$
In [46]:
library(pracma)
plot(px, sigmoid(px, a = 1), type = "l", 
        ylab = "spiking frequency (normalized)", xlab = "amount of stimulation", main = "Sigmoid")

Neural net

  • Representation
    • Many neuron-like units
    • Many weighted interconnections
    • Highly parallel processes
  • Optimization
    • Learning: tune weights automatically

Let's start with a simpler dataset

In [48]:
library(mlbench)
spirals = as.data.frame(mlbench.spirals(n=1000, cycles=1.5, sd=0.05))
spiral_task = makeClassifTask(data = spirals, target = "classes")
ggplot(data=spirals, aes(x=x.1, y=x.2, color=classes)) + geom_point() + coord_fixed(ratio=1)

The perceptron can't learn it

In [49]:
# The perceptron again
mlp = makeLearner("classif.nnet", par.vals = list(size = 1, trace = FALSE))
plotLearnerPrediction(mlp, spiral_task, features = c("x.1", "x.2")) + theme_cowplot()

We can build a neural net, with 1 layer of hidden nodes

  • Black = positive weights, gray = negative, thickness = magnitude
  • B1, B2,... are the bias layers
In [50]:
library(nnet)
mod1<-nnet(rand.vars,resp,data=spirals,size=10,trace=F)
plot.nnet(mod1)
In [51]:
# 1 hidden layer, 10 nodes
mlp = makeLearner("classif.nnet", par.vals = list(size = 10, trace = FALSE))
plotLearnerPrediction(mlp, spiral_task, features = c("x.1", "x.2")) + theme_cowplot()
In [52]:
# 1 hidden layer, 100 nodes
mlp = makeLearner("classif.nnet", par.vals = list(size = 100, trace = FALSE))
plotLearnerPrediction(mlp, spiral_task, features = c("x.1", "x.2")) + theme_cowplot()

Learning the network weights

  • With a networks of neurons, you could learn much more complex functions
  • But how to learn them?
    • Perceptron rule never generalized to work on networks
  • Gradient descent!
    • But, network of linear functions is still a linear function
    • We need a non-linear function in the nodes
    • Preferably one that is easily differentiable
    • And inspired by the brain
In [53]:
plot(px, sigmoid(px, a = 1), type = "l", 
        ylab = "Output", xlab = "Weighted sum of inputs (u)", main = "Sigmoid")

Nice property: $$\frac{\delta \sigma(u)}{\delta u} = \sigma(u)(1-\sigma(u))$$

  • When we do gradient descent, we can replace differentials easily

Computing the gradient for sigmoid units

\begin{align} \frac{\delta E}{\delta w_i} = \frac{\delta}{\delta w_i} \frac{1}{2} \sum_d (t_d-o_d)^2 \\ & = \frac{1}{2} \sum_d \frac{\delta}{\delta w_i} (t_d-o_d)^2 \\ & = \frac{1}{2} \sum_d 2 (t_d-o_d) \frac{\delta}{\delta w_i} (t_d-o_d) \\ & = \sum_d (t_d-o_d) \frac{\delta}{\delta w_i} (- \sigma(u_d))\\ & = - \sum_d (t_d-o_d) \frac{\delta \sigma(u_d) }{\delta u_d} \frac{\delta u_d}{\delta w_i}\\ & = - \sum_d (t_d-o_d) \sigma(u_d)(1-\sigma(u_d)) x_{i,d}\\ & \end{align}$$ u_d = w \cdotp x_d $$

Backpropagation

  • Propagate example through the network
  • Back-propagate the error signal from output back through the layers
    • How does changing the $j$ weights of neuron $k$ change the final error? $\frac{\delta E}{\delta w_{j,k}}$
  • To compute those, we introduce an intermediate quantity: the 'error' in the signal of neuron $k$: $$ \delta_k = - \frac{\delta E}{\delta u_k} $$
  • The errors $\delta_k$ can be computed based on the errors in the next layer
  • Order of computation is back to front, ultimately relating $\delta_k$ to $\frac{\delta E}{\delta w_{j,k}}$

Understanding $\delta_k$

  • Imagine a demon that adds a little change $\Delta u_k$ to the input of neuron $k$
    • The neuron now outputs $\sigma(u_k + \Delta u_k)$ instead of $\sigma(u_k)$
    • Propagates through network, ultimately causing an error $\frac{\delta E}{\delta u_k} \Delta u_k$
  • A good demon helps you improve the error by trying to find a $\Delta u_k$ that reduces the error
    • If $\frac{\delta E}{\delta u_k}$ is large, choose $\Delta u_k$ to reduce it

Tampering with the weights creates an error downstream.

For hidden node $h$, one layer before nodes $k$: \begin{align} \frac{\delta E}{\delta uh} = \sum{k \in outs(h)} \frac{\delta E}{\delta uk} \frac{\delta u_k}{\delta u_h} \quad(\text{sum rule + chain rule})\ & = \sum{k \in outs(h)} -\deltak \frac{\delta u_k}{\delta u_h} \ & = \sum{k \in outs(h)} -\deltak \frac{\delta u_k}{\delta o_h} \frac{\delta o_h}{\delta u_h} \ & = \sum{k \in outs(h)} -\deltak w{k,h} \frac{\delta oh}{\delta u_h} \ & = \sum{k \in outs(h)} -\deltak w{k,h} o_h(1-o_h) \ & \end{align}

$$\delta_h = o_h(1-o_h) \sum_{k \in outs(h)} \delta_k w_{k,h} $$
In [54]:
mod1<-nnet(rand.vars,resp,data=spirals,size=10,trace=F)
plot.nnet(mod1)

Backpropagation algorithm

  • Initialize all weights to small random numbers
  • Until convergence, do:
    • For each training example, do:
      • Forward pass: propagate example, compute outputs
      • For each output unit $k$: $$\delta_k = o_k(1-o_k)(t_k-o_k)$$
      • For each hidden unit $h$: $$\delta_h = o_h(1-o_h)\sum_{k \in outs(h)} \delta_k w_{k,h}$$
      • Update each network weight: $$w_{i,j} = w_{i,j} + \Delta w_{i,j} \quad \text{where} \quad \Delta w_{i,j} = \eta \delta_j x_{i,j}$$
  • You can use batch or stochastic gradient descent
  • There exist efficient techniques using matrix products

Properties of backprop

  • Does gradient descent over entire network weight vector
  • Works on any directed graph (doesn't require layers)
  • Will find a local (not global) optimum
    • Works well in practice, with restarts (epochs)
    • Learning rate will never be optimal everywhere (slow)
  • Speedup by including weight momentum $\alpha$
    • Like a ball accellerating down the hill
    • Also helps avoid local minima $$\Delta w_{i,j}(n) = \eta \delta_j x_{i,j} + \alpha \Delta w_{i,j}(n-1)$$
  • Minimizes error over training example (can still overfit)
  • Training can take thousands of iterations (slow)
  • Using networks after training is very fast

Properties of backprop

  • Convergence: avoid local minima
    • Add momentum
    • Stochastic gradient descent
    • Restart with different initial weights
    • Initialize weights near zero: start with near-linear networks
  • Expressiveness
    • Can approximate any bounded continuous function, with arbitrarily small error, by 1-layer network
      • Linear combination of sigmoid functions
    • Any function can be approximated, with arbitrarily small error, by 2-layer network.

Overfitting

  • If you keep adding hidden units, you can just memorize the input
  • Penalize large weights by complexity penalty $\gamma$ (weight decay): $$ E(w) = \frac{1}{2} \sum_{d \in D} \sum_{k \in outputs} (t_{k,d} - o_{k,d})^2 + \gamma \sum_{i,j} w_{j,i}^2$$
  • Train on target slopes as well as values:
    • Penalty on difference between target slope and actual slope: $$ E(w) = \frac{1}{2} \sum_{d \in D} \sum_{k \in outputs} \left[ (t_{k,d} - o_{k,d})^2 + \mu \sum_{j \in inputs} \left( \frac{\delta t_{k,d}}{\delta x_d^j} - \frac{\delta o_{k,d}}{\delta x_d^j} \right)^2 \right]$$
  • Weight sharing: translate the same weight over layer
    • in vision problems: nearby weights should have similar values
    • convolutional neural networks
  • Early stopping
    • Use a validation set and detect when test error goes up
    • Equivalent to weight decay if you start with small weights

Understanding the weights: Autoencoders

  • Output layer must exactly return the input
  • Forced to code the information in fewer bits
  • Sparse autoencoders: use many more nodes in hidden layer, but switch most of them off at any given time

An autoencoder.

Stacked autoencoders

  • Using backprop for deep networks is extremely hard
    • Error signal diffuses until nothing is left
  • Sparse autoencoders can be pre-trained by greedy layer-wise training
    • Freeze weights of the other layers, train one layer at a time
    • Use raw output of previous layers to train subsequent layers
  • A softmax function can be used in the last layer to squash the activation to a probability value (as in logistic regression)

Stacked autoencoders

Stacked sparse autoencoders encode increasingly complex features

  • First layer tends to learn first-order features of the raw input (e.g. edges of image)
  • Second layer learns second-order features (e.g. contours, corners),...

Feature learning with stacked autoencoders

Convolutional neural network

  • For audio, video, images (smooth data)
  • Uses multiple copies of the same neuron
  • Every input neuron only wired together with nearby nodes

Convolutional neural network

In [60]:
mlp = makeLearner("classif.nnet", par.vals = list(size = 200, trace = FALSE))
plotLearnerPrediction(mlp, mnist_task, features = c("pixel217", "pixel518")) + theme_cowplot()

Questions?

Acknowledgements

Thanks to Bernd Bischl and Tobias Glasmachers for useful input.

In [ ]: